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Abstract 

We present a general construction for de- 
pendent random measures based on thinning 
Poisson processes on an augmented space. 
The framework is not restricted to dependent 
versions of a specific nonparametric model, 
but can be applied to all models that can be 
represented using completely random mea- 
sures. Several existing dependent random 
measures can be seen as specific cases of this 
framework. Interesting properties of the re- 
sulting measures are derived and the efficacy 
of the framework is demonstrated by con- 
structing a covariate-dependent latent fea- 
ture model and topic model that obtain su- 
perior predictive performance. 



1 Introduction 

Motivated by a desire for flexible models that mini- 
mize assumptions about the underlying structure of 
our data, Bayesian nonparametric models have gar- 
nered much attention in the machine learning and 
statistics communities. Most Bayesian nonparamet- 
ric models assume observations are exchangeable. In 
real life, this assumption is usually hard to justify. We 
often have side information - for example time stamps 
or geographical location - that we believe influences 
the latent structure of our data. 

There has been growing interest in models that 
challenge this exchangeability assumption, while still 
maintaining desirable properties of the original non- 
parametric processes. A dependent nonparametric 



process (MacEachern, 1999) is defined as a distribu- 



tion over collections of measures indexed by values in 
some covariate space, such that the marginal distribu- 
tion at some covariate value is described by a known 
nonparametric process. A number of authors have 
proposed dependent versions of the Dirichlet process 



(for example Griffin & Steel 2006 


Rao & Teh[ 2009 


|Chung & Dunson[|2011||Duan et al. , 


2007) Caron et al. 



2007), the Pitman- Yor process (iSudderth & Jordan! 



2009) and the Indian buffet process (Ren et al. , 2011 



Williamson et al., 2010 Zhou et al., 2011). 



Most of the basic nonparametric processes found in 
the literature can be formulated in terms of completely 
random measures (CRMs, Kingman 1967) - distribu- 
tions over measures that assign independent masses 
to disjoint subsets of the space on which they are de- 
fined. For example, the Dirichlet process can be ob- 
tained by normalizing the CRM known as the gamma 
process. The IBP can be described in terms of a mix- 
ture of Bernoulli processes, where the mixing measure 
is a completely random measure known as the beta 



process (Thibaux & Jordan, 2007). 



Completely random measures on some space 6 can be 
represented as Poisson processes on the product space 
x R + . In this paper, we show that a large class of 
dependent nonparametric processes can be described 
in terms of operations on Poisson processes on an aug- 
mented space X x 6 x R + . This framework offers 
great flexibility in the form of the dependency, and of- 
ten leads to simple posterior updates, as any conjugacy 
present in the non-dependent version of the model is 
carried over to the dependent case. The resulting class 
of distributions contains, or is related to, several exist- 



ing models, such as the kernel beta process (Ren et al. 



2011|) a nd the spatial normalized gamma process ( Rao 



Tbhl[2009|). A major contribution of this paper is 



to express the relationships between these models in 
a simple manner, and aid in the understanding of ex- 
isting models and the development of new models and 
inference techniques. 

We use this framework as a basis for two models: a 
covariate-dependent latent variable model based on 
the beta process, and a covariate-dependent topic 
model based on the gamma process. We show that 
incorporating dependency can improve the predictive 
power of Bayesian nonparametric models, and that by 
making use of conjugacy and simpler forms of depen- 
dence, we can obtain comparable results to existing 
dependent nonparametric processes, with a dramatic 
decrease in time spent on inference. 



2 Background 

A completely random measure (CRM) is a distribu- 
tion over measures on some measurable space (0, J-e), 
such that the masses T(Ai), T(A2), • • • assigned to dis- 
joint subsets Ai, A 2: • • • G J-e by a random measure T 
are independent ( Kingman] |1967[ ). The class of com- 
pletely random measures contains important distribu- 
tions such as the beta process, the gamma process, the 
Poisson process and the stable subordinator. 

A CRM on 6 is characterized by a positive Levy mea- 
sure v(dQ,dn) on the product space 6 x R + , and can 
be represented in terms of a Poisson process on this 
space. Let II = {(0k, 7rk)} < j£ =1 be a Poisson process on 
x R + , with mean measure v(d6, dn). Then the com- 
pletely random measure with Levy measure v(d9, dn) 
can be represented as T = J2T=i n k5(o k )- 

De Finetti's theorem tells us that any infinitely ex- 
changeable sequence can be described as a mixture of 
i.i.d. distributions. CRMs provide the mixing distri- 
bution in the de Finetti representation of a number 
of useful exchangeable distributions. For example, we 
can represent the Indian buffet process, a distribution 
over exchangeable binary matrices, as a beta process 
mixture over countably infinite collections of Bernoulli 



random variables |Thibaux & Jordan 2007). The re- 



sulting distribution over exchangeable binary matrices 
is an appropriate prior for nonparametric versions of 
latent feature models. Other distributions over ex- 
changeable matrices have been defined using the beta 
process (|Zhou et al. , 2012b) and the gamma process 
(Titsias, 2007 Saeedi fc Bouchard- C6te| [201 1[ ) as mix- 
ing measures. 

We are often interested in learning distributions over 
probability distributions - for example, for use in clus- 
tering or density estimation. Two important examples 
of such distributions - the Dirichlet process and the 
normalized stable process - can be obtained by nor- 
malizing the gamma process and the stable subordi- 
nator, respectively. CRMs can also be used directly as 
a prior on hazard functions in survival analysis appli- 



cations ( jlbrahim et al.||2005| ). 



3 Construction of dependent random 
measures via thinned Poisson 
processes 

Let II = {(xi 1 6i 1 iti)} ( *L 1 be a Poisson process (PP) on 
the space ^x0x R + . This space has three compo- 
nents: X ', an auxiliary space; O, a space of parameter 
values; and R + , which will be the masses making up 
the random measures. Let the mean measure of II be 
described by the positive Levy measure v(dx, d6, dn). 



While the theory herein applies for any such Levy mea- 
sure, we will focus on the class of Levy measures that 
factorize as 

v(dx,dQ,dn) = G(dx,d6)vo(dn). 

This corresponds to the class of homogeneous com- 
pletely random measures, where the size of an atom is 
independent of its locations in and X 

It follows that T = J2T=i n k5(x k ,o k ) 1S a CRM on XxQ. 



By the mapping theorem of PPs ( |Kingman| |1993[ ) we 
see that B 



measure given by 



is a CRM on 6 with rate 



VB(dQ, dn) 



v(dx, d6, dn) = vo(dn) / G(dx,d0) 
x J x 



Let T be some covariate space - for example time - and 
let {p x : T —> [0,1]}^^ be a collection of functions 
indexed by x G X. We can now construct a family of 
random measures B t dependent on values t G T. For 
each point (x^-, tt^) G II, define a collection {r^jter 
of Bernoulli random variables (so r\ is a binary valued 
random function on T), such that p(r\ = 1) = p Xk (t). 
The r f k s indicate whether atom k in the global measure 
B appears in the local measure B t at covariate value 
t. Therefore, the function p x controls the degree of 
dependence between two measures B t and B t > . 



Appealing to the marking theorem of PPs ( Kingman 
[1993] ), we see that the resulting thinned PP 11^ and its 
associated rate measure v t are described by 



n t = {(x k ,0 k ,n k ) 14 = 1} 



k=l 



v t (A, dQ, dn) 



p x (t)v(dx, dQ, dn) 



xeA 



for A G Fx- Then, applying the mapping theorem to 
11^ and employing the sum form of a CRM, we find 

B t = T,k: r*=l n ^e k = Efell r l^e k 

is a CRM on that varies with t G T and has rate 
measure 

VB t (d0,dn)= / p x (t)v(dx,d0,dn) = Vt(X \d6,dn) 
Jx 

(i) 

which, given certain forms of p x and ^, may be simpli- 
fied further. We refer to {B t } te j- as a thinned CRM. 

If the thinning function p x (t) is taken to be a bounded 
unimodal kernel function K(t,m,(/)), where x := 
(m, (j)) gives the center and dispersion of the kernel, 
we can interpret the model as saying that each atom 
Kk$(x k ,e k ) °f the CRM defined on X x x R + is "ac- 
tive" in some subregion of T, dictated by a location 
mk and a dispersion <pk- However, p x need not be 
unimodal, or even a kernel. Later, we will consider a 
form for p x that allows atoms to be active at multiple 
locations. 



3.1 Properties of thinned CRMs 

The moments of B t (A) for any x G X and A G J~e 



measure featuring this atom will decrease monotoni- 
cally. Such a model is described as: 



man 



can be determined from Campbell's theorem (King- 
1993) using Eq. [I] Another quantity of interest 



is the correlation between the marginals of a thinned 
CRM at two covariate values t and t f . Assuming that 
V(7Tfc) < oo, which holds for most CRMs used in prac- 
tice, we have 

Coir(£ t (A),Bt'(A)) 

E(r^y,/)V(7r,) 

k:6 k eA 



T := Er=i *k6 lXk ,e„) ~ CRM(i/(dar,dfl,d7r)) 
Px (t) = f(\x-t\) 
r\ ~ Ber^Ji)) 



(2) 



where X = T and / : A? — »> [0, 1] is some unimodal 
function on A', for example a scaled Gaussian density. 



E((4) 2 )V(^ fc ) ^ E((4 / ) 2 )V(^) 4 * 2 A multiple-location thinned CRM 



y k:9 k eA 



k:O k £A 



where = (px k (t))kLi- I n other words, the corre- 
lation between the two random measures is given by 
the correlation between the thinning indicators r l and 
r l , and is independent of the Levy measure. We can 
therefore specify the correlation between the measures 
at different covariate values through the form of p x . 
In general, smooth functions will capture the intuitive 
notion that measures at nearby covariates should use a 
similar set of atoms. Arbitrary correlation structures 
can be obtained via appropriate choice of p x . 

A key property of the construction is that the result- 
ing dependent random measures are of the same form 
as the original process - the component uq of the Levy 
measure that governs the atom sizes is unchanged, and 
the TTkS are distributed as before. This is desirable in 
order to retain conjugacy in the model being used. 
The thinned CRM framework puts very few restric- 
tions on the original process allowing us to construct 
a large family of dependent CRMs, whereas previous 
constructions have been limited to specific processes. 

4 Examples 

In this section, we describe thinned CRMs with two 
different dependency structures, and two hierarchical 
models based on such thinned CRMs. 

4.1 A single-location thinned CRM 

One of the simplest forms of covariate dependency is 
to assume that the expected correlation between two 
measures decreases with increasing distance in covari- 
ate space. This can be captured by choosing the thin- 
ning probability for each atom of the global CRM to be 
a unimodal distribution centered on a point in covari- 
ate space - as we move away from this location in co- 
variate space, the probability of a covariate-dependent 



The form of dependency in Section 4.1 is restrictive: 
the probability of an atom contributing to a CRM de- 
cays with distance in covariate space. If each atom 
corresponds to a feature in a latent factor model, this 
means that, in practice, each feature is only going to 
contribute to data points within a restricted covariate 
range. 

Greater flexibility can be obtained by replacing the 
unimodal function / in Eq. [2] with an arbitrary func- 
tion g. The function g might, for example, be a Gaus- 
sian random field on T that has been transformed via 
a sigmoid function at every value of t G T. 

As a concrete example, consider one such construction 
of a covariate-dependent beta process. Here, the base 
CRM is a homogeneous beta process, with Levy mea- 
sure VB(dx,d6,dTr) = cH(dx)Bo(d6)7r~ 1 (l — 7r) c ~ 1 d7r 
on ^ x x [0,1], for some constant c > and proba- 
bility measures H and Bo. 

For the thinning function, we choose a transformed 
relevance vector machine kernel. The relevance vector 



machine (Tipping, 2001) can be seen as the weighted 



sum of (a finite number of) Gaussian kernels. Loca- 
tions in the auxiliary space X correspond to the set of 
centers, weights and widths of these kernels. A stan- 
dard modeling decision, which we adopt in our exper- 
iments, is to fix the centers of these kernels to the L 
locations ti, . . . ,£l of the data in covariate space T. 
Each location Xk G X therefore corresponds to a set 
of L + 1 weights uj\k G R, and a (shared) width fa se- 
lected from a fixed dictionary D. Our auxiliary space is 
therefore defined as X := R L+1 x D, and our base mea- 
sure H(dx) can be decomposed into a normal-inverse 
Gamma prior on each of the weights, and a categori- 
cal prior on the widths. In our experiments, we chose 
small values of cq and do (see below) resulting in most 
ujik being small, which implies that p Xk (t) will be large 
at few locations. 

In order to ensure valid thinning probabilities, we 
transform the RVM kernel pointwise using a probit 



function. The generative procedure is given by: 

T := £r=i^<W) ~ CRM(v B (dx,d9,dn)) 
uiik ~ NiG(0, c , d ), <j>k ~ CatOi, . . . , <j> D ) 

p Xk (t) = $(w fc + Yli=i u ikexp(-4>k \\t - ttWl)) 
r{ ~Ber(p Xfc (i)) 



(3) 



4.3 A dependent latent feature model 



We can use covariate-dependent CRMs to construct 
covariate-dependent latent variable models. In such a 
setting, each atom of the CRM on X x O x R + is asso- 
ciated with a latent feature, and the mass of that atom 
parameterizes a distribution over the weight assigned 
to that feature. Each data point then selects a weight 
for each feature according to the masses of the atom 
in the corresponding thinned CRM. 

As an example, consider a latent feature model based 
on the covariate-dependent beta process described in 
Eq.[3| where Bo(dO) is the multivariate Gaussian prior 
measure - i.e. each latent feature is a real- valued vec- 
tor. For each covariate value t G T, a subset of 
these features, and their corresponding atom weights 
7Tfe, are selected as in Eq. [3] to give a local measure 
B t = J2kLi r fc 7r fe^ fc - For each data point n at covari- 
ate value x, a subset of features are chosen by selecting 
each feature with probability r^k- The selected fea- 
tures are combined via linear superposition, and Gaus- 
sian noise is added. The generative model is as follows: 

z n ^ ~ Ber(r£7T fc ) , t G T, k G N, n e {1, . . . , N t } 
A k ~ A/"(0, a 2 A l) ,k e N 

where r\ and itk are sampled according to Eq. [3] and 
N t denotes the number of data points with covariate t. 
In the case where each observation is associated with 
a unique covariate we simplify the notation to z n k and 

This model is a dependent version of the linear Gaus- 
sian IBP model proposed by | Griffiths &; Ghahramani 



(2005). The model can be extended by using differ- 



ent models for generating and combining the features 



(Wood et al. 2006 



Miller 



2011), or by sampling a 



real- valued weight s fe ' for each instance of a feature 



and combining them as y n 



(Zhou et al. 2012a). 



A/-(E, 



Q n,tn,t 



A k ,a 2 l) 



4.4 A time-varying topic model 

Topic models are popular latent variable models that 
decompose a text corpus into the underlying topics. 



Topic models define a topic as a probability distribu- 
tion over a finite vocabulary with P terms. The sim- 
plest topic model is latent Dirichlet allocation (LDA, 
2003[ ) where the words in each document 



Blei et al. 



are generated by first drawing which topic the word ex- 
hibits from a document-specific topic distribution and 
then drawing the actual word from the correspond- 
ing topic. The basic LDA model has been extended 
in many ways, for example, to allow correlated topics 
(iBlei & Laffertyl 120071 iPaisley et all 120111), to allow 



the topics to drift over time (Blei & Lafferty, 2006 



Wang et al. , 2008) and to allow topic usage to vary 



over time (Wang & McCallum 2006). 



The thinned CRM construction described in this paper 
can be used to construct a time-varying topic model 
where the topics are assumed fixed, but the usage of 
the topics changes over time. This assumption allows 
the learned topics to be localized in time. As in [Zhou] 
et al. (2012b), we formulate our topic model as a Pois- 



son factor model. 

We use a thinned gamma process (tGaP) to model the 
global popularity of each topic and the relevance vec- 
tor machine in Eq.[3]as the thinning function. Let w pnt 
denote the number of times the pth word (in a vocabu- 
lary of P words) appears in the nth document at time 
t. Let VG(dx,d9,d7r) = VGo(d7r)H(dx)Bo(d6), where 
^go(^tt) = 77T -1 exp(— A7r)d7r) is the Levy measure 
of the gamma process; Bo(d6) is the P-dimensional 
Dirichlet distribution with parameter olq\ and H(dx) 
is the prior over parameters for the RVM as described 
in Section l42l 



The complete model, denoted tGaP-PFA, is specified 
as 

00 

T := ^2^kS( Xhi e h ) ~ CRM(v G (dx,d6,d7r)) 
k=i 

p Xk (t) = ®(u>ok + S^i^zjfe exp(-0fc \ \t - ti\\l)) 



Gv-~\oo n,£ r 
n,t ■= l^ k =l r k Kk08 k 

Ga(e,l),n=l,...,jV t ,fceN 
Wpmk ~ Pois(6> fcp rJ?'*7T fe $?'*) 



(4) 



w. 



pnt 



k=i 



k=i 



where the RVM machinery is presented in Eq. [3] The 
vectors 6k = (#/ci, • • • , @kp) are the topics, the atom 
sizes irk can be interpreted as the baseline rate that 
this topic generates words and the are document- 
specific modulations of the global rate so that docu- 
ments can exhibit a topic more or less than the base- 
line. 



5 Relationship with other processes 

The framework for dependent random measures pro- 
posed in Section [3] is very general, and includes or is 
related to a number of existing models, as we describe 
below. 

5.1 Kernel beta process 



The inference scheme described by Rao & Teh (2009) 



The kernel beta process (KBP, Ren et al. , 2011 ) has an 
interesting interpretation in terms of thinned CRMs. 
Let T be our covariate space, ^ be a space of pos- 
sible dispersions, and define our auxiliary space as 
X :=Tx^. Let v(dx,d0,dn) := v (d7r)H(dx)B (d6) 
be a Levy measure on X x 9 x R + , such that v^(di\) — 
C7r _1 (l — 7r) c-1 <i7r. Let p x (t) = K(t,m,ijj) for every 
x := m x ip G X, where K(-, •, •) is a unimodal ker- 
nel with mean m and dispersion ijj bounded above by 
1. Then, the expectation of a realization of the corre- 
sponding thinned CRM is given by 



k=l 



HW$e k 



oo 

£ 

k=l 



K(t,m k ,i/j k )7r k 5o k 



which is exactly the form of the KBP. In other words, 
the KBP is a mixture of kernel-thinned beta processes. 
The thinned beta process provides a generative pro- 
cess for the KBP, which could be useful for formulat- 
ing the KBP in a probabilistic programming language 
(e.g. Goodman et al. , |2008). In fact, the inference 



algorithm described in |Ren et al. (2011) uses such a 
representation. 

While the KBP can be described without using a 
thinned Poisson process representation, we feel that 
the above derivation is easier to interpret, and prop- 
erties of the KBP can be simply derived by appealing 
to well-known properties of marked Poisson processes. 
Additionally, the thinned Poisson process construction 
makes extending the KBP idea to arbitrary CRMs sim- 
ple, whereas the original construction relied on specific 
properties of the beta process. 

5.2 Spatial normalized gamma processes 

Just as a Dirichlet process can be constructed by 
normalizing a gamma process, a dependent Dirichlet 
process can be constructed by normalizing a thinned 
gamma process. If X = T and the thinning proba- 
bility is given by p Xk (t) = / °°I[|t - x k \ < i]f{i)di 
for some distribution /(•) over window size /, then af- 
ter normalization, this describes the spatial normalized 



Gamma processes (SNrP) of Rao fc Teh (2009) where 



a Dirichlet process, D t exists at all covariate values. 

Incorporating the Poisson process representation into 
the SNfP model could yield a number of benefits. 



involves representing each D t as a mixture of inde- 
pendent DPs, and performing inference in the corre- 
sponding mixture of urn schemes. However, this ap- 
proach does not scale well to higher dimensional co- 
variate spaces, as the number of independent regions, 
and thus the number of DPs that need to be repre- 
sented, grows exponentially with the dimensionality 
of the space. In addition, a naive MCMC implemen- 
tation mixes poorly, necessitating the use of expensive 
split /merge Metropolis Hastings steps. Representing 
the SNrP as a normalized thinned CRM opens up the 
possibility of different inference algorithms that rep- 
resent D t explicitly, and may yield more scalable and 
efficient implementations. 

In addition, understanding the model in terms of 
thinned Poisson processes makes it easier to change 
the form of dependency from the box kernel employed 
in the original paper. Alternative kernels, for example 
an exponential kernel, could be used to give a softer 
falloff and hence more flexible dependency. 

5.3 Ornstein-Uhlenbeck Dirichlet process 

The Ornstein-Uhlenbeck Dirichlet process (OUDP, 



Griffin 2007) can be constructed in a similar manner 



to the KBP, with an added normalization step. We 
define T = R and X := K x , and let v(dx, d0 : dir) := 
vo(d7r)H(dx)Bo(d9) be a Levy measure on X x x M + 
and T the associated CRM. Let {Gt} be the depen- 
dent CRM obtained when we choose the thinning 
probability p Xk (t) = K(t,m k ,ilj k ) to be an Ornstein- 
Uhlenbeck kernel, and vs{dii) to be the Levy measure 
of a Gamma process. The OUDP is then obtained as 
D t( A ) = Ep g [S(e)r For the Ornstein-Uhlenbeck ker- 
nel, Griffin showed that D t is a DP. Unfortunately, 
the proof does not easily extend to arbitrary kernels 
or higher dimensional spaces. However, treating D t as 
a mixture of normalized thinned CRMs immediately 
shows that if we instead consider the "complete repre- 
sentation" of the random measures {D t }teT> then the 
marginal distributions are Dirichlet processes, regard- 
less of what kernel we pick and the covariate space. 

5.4 Other dependent nonparametric 
processes with Poisson process 
interpretations 

In addition to models that can be represented in terms 
of thinned Poisson processes on an augmented space, 
there are a number of other models that can be de- 



scribed in terms of Poisson processes. Lin et al. (2010) 



create a Markov chain Pi, P 2 , . . . of Poisson processes 
on O x R + such that, at each time point, the corre- 
sponding Poisson process is obtained by thinning the 



previous Poisson process and superimposing an inde- 
pendent Poisson process. The resulting chain of Pois- 
son processes defines a Markov chain of gamma pro- 
cesses, which are normalized to give a Markov chain 
of Dirichlet processes. A similar procedure, with thin- 
ning probability that depends on the atom sizes, un- 
derlies the size-biased deletion form of the dependent 



Poyla urn model of Caron et al. (2007). These models 
do not fit neatly into the framework described in this 
paper, and are restricted to Markovian dependency 
and discrete covariate spaces. 

6 Experimental evaluation 

We illustrate the effectiveness of using a thinned CRM 
to relax the assumption of exchangeability in non- 
parametric Bayesian modeling on both synthetic and 
real data. Specifically, we consider the approaches 

where a probit 



described in Sections |4.3| and 4.4 , 
RVM thinned CRM is used as the basis for covariate- 
dependent binary latent feature models and covariate- 
dependent topic models, respectively. The experi- 
ments using binary latent feature models are per- 
formed to allow comparisons to existing work and to 
show that the proposed prior is indeed capturing the 
structure, and not the likelihood. The topic model 
experiments are intended to highlight the ease with 
which we can incorporate dependency into more com- 
plex hierarchical models, and demonstrate the perfor- 
mance gains such covariate dependency can yield. 
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Figure 1: Recovering feature probabilities in synthetic 
data. Top row: Features used to generate synthetic 
data. Second row: Time-varying thinning probabili- 
ties used to generate synthetic data. Third row: Re- 
covered features (manually aligned to match generat- 
ing features). Bottom row: Recovered time- varying 
thinning probabilities. 



For the latent feature model described in Section [4731 
the remaining Gibbs sampling equations are all easily 
derived by conjugacy, and Zhou et al. ( 2012a[ ) provides 
most of the required distributions. For the topic model 
described in Section fO} the Gibbs sampling equations 
for the remaining variables are described in the sup- 
plement. 

6.2 Dependent binary latent feature model: 
Synthetic data 



6.1 Inference 

Inference in the probit RVM model described in Sec- 
tion |4.2| is carried out using Gibbs sampling. We 



consider a truncated version of the beta process and 
gamma process for computational simplicity. In both 
cases, we select a truncation level K. To approxi- 
mate the beta process, we draw K atom sizes from 
a Beta (-^, 1 — distribution. The resulting K- 
dimensional vector can be shown to converge to a draw 
from a beta process as T — >> oo (Paisley & Carin, 2009). 



Similarly, to approximate the gamma process, we draw 
K atom sizes from a Ga(-^, 1) distribution. 

T he weights {uiik\ can b e sampled using the method 
of |Albert fc Chibj | ( |1993| ) using the r{ of Eq. [3] as ob- 
servations. To allow conjugate updates for 7^ we in- 
troduce an auxiliary variable b k jt for each data point 



and feature such that z^ 1 = 1 iff b^ = 1 and r k — 1. 



We then sample T X) from their joint distribution, 
which can be enumerated since both variables are bi- 



nary. A similar scheme was used in Ren et al. (2011). 
Lastly, we sample the kernel dispersion parameters 
from a fixed finite dictionary of possible values with a 
uniform prior over the possible values. 



To demonstrate the model's ability to uncover co- 
variate dependent structure we generated synthetic 
data similarly to the "bag of items" experiment in 
Williamson et al. (2010). Here, the covariate space 



is the real line, and the covariate values are the in- 
tegers 1, . . . , 20. The data was generated using eight 
64-pixel image features (depicted in the top row of 
Fig. [I]), and eight corresponding time- varying thin- 
ning probabilities generated using the RVM kernel 
described in Section |4.2[ with kernel weights uoik ~ 
Kk^o + (1 — ^fc)A/"(0, 4), Kk ~ Beta (1,1). Each kernel 
had dispersion parameter (/>, implying that all features 
vary on the same scale. The resulting thinning prob- 
abilities p Xk (t) are shown in the second row of Fig. [I] 
For each location t G {1, . . . , 20} we generate a binary 
matrix Z y = {z v n k } G {0, 1} 100x8 of feature usage in- 
dicators for 100 data points at location y using the 
sampling equation for z l n k in Eq. 
erate data for each t as Y l = Z l A 



4.3 



Finally, we gen- 
where the rows 



of A are the eight features and E is a matrix of obser- 
vation noise with each entry normally distributed with 
mean and variance 0.25. 

We perform inference using the Gibbs sampler de- 
scribed above, with a truncation level of 20 features 



Table 1: RMSE on UN developmental data 



Exchangeable 


thinned BP 


dIBP 


1.02 ±0.08 


0.85 ± 0.11 


0.73 ±0.05 



and learned individual scales for each kernel. The 
resulting learned features and their respective thinning 
probabilities are depicted in the third and fourth rows 
of Fig. [I] The model usually learns the correct dimen- 
sionality of the data and thinning probabilities as in 
the case depicted, however, sometimes extra features 
are used to explain the noise present in the data in 
addition to the correct features. 

6.3 Dependent binary latent feature model: 
U.N. development indicators 

We evaluate the model in a predictive setting on a UN 
dataset consisting of 15 developmental indicators for 
144 countries. This dataset was used by Williamson 
( |2010 ) to evaluate an alternative dependent 



et al. 



latent feature model known as the dependent IBP 
(dIBP). The dIBP induces dependency directly be- 
tween corresponding elements of a collection of binary 
vectors using a transformed Gaussian process. 

We follow the experimental protocol used in 



Williamson et al. (2010) where 14 countries are se- 
lected at random as a test set, and the model is trained 
on the remaining 130 countries. For each test country 
we observe a single feature chosen at random (possibly 
a different feature for each country) and the goal is to 
predict the remaining 14 unobserved features. The co- 
variate for the thinned beta process model is log-GDP 
of the country. We perform 10- fold cross-validation 
and report the mean RMSE and two standard devi- 
ations in Table [I] where we compare the results for 
an exchangeable beta-Bernoulli process feature model, 
the thinned beta process model and the dIBP model. 

The thinned beta process model obtained lower RMSE 
than the exchangeable model on all folds, indicat- 
ing that incorporating covariate information improves 
modeling performance. The best results are obtained 
by the dIBP. This is not surprising, because the Gaus- 
sian processes used are flexible enough to model arbi- 
trary changes in the latent structure. However, this 
added performance comes at a cost - the dIBP uses a 
single Gaussian process for each latent feature, and 
inference in the Gaussian processes scales cubically 
with the number of covariate locations. In addition, 
the dIBP does not make use of conjugacy, which in- 
creases the computational costs. While the exchange- 
able model and the thinned BP models ran on the 
order of hours, the dIBP ran on the order of days. 
We feel the thinned BP provides a compromise be- 



tween improved accuracy by taking covariate informa- 
tion into account and running time. 

6.4 Time-varying topic model 

We evaluate the time-dependent topic model proposed 



in Section [44] both quantitatively and qualitatively on 
the State of the Union dataset, which consists of the 
full texts of the addresses for presidents George Wash- 
ington to George Bush covering the years 1780-2002. 



As in Wang & McCallum ( 2006 ) we break up the ad- 



dresses into documents of three paragraphs. This re- 
sulted in 5997 documents. We created our vocabu- 
lary by computing the term-frequency inverse docu- 



ment frequency (TFIDF, iManning et al. , 2008) score 



of all observed words and only keeping those with at 
least 10 occurrences in the corpus and in the upper 
0.15 quantile of the observed TFIDF scores resulting 
in a vocabulary with 997 words. All results are re- 
ported for the Dirichlet parameter, olq = 0.05, with 
comparable results obtained with other values. Large 
values of olq result in few topics being learned and vice 
versa. We report the average number of topics learned 
for each model with with this setting of olq in Table [2] 

We evaluate our model using three tasks: perplexity 
on held out data; time-stamp prediction; and quali- 
tative evaluation. The perplexity evaluation was car- 
ried out following Zhou et al. (2012b), by holding out 
20% of the words from each document, training the 
model on the remaining 80% and computing the per- 
plexity of the held-out words (as described in the sup- 
plement). We compared the tGaP-PFA model against 
a static version of the same model (obtained by de- 
terministically setting all r^ 1 = 1) and against the 
beta-negative binomial process (BNBP) model of Zhou 
eTaLl ( [20T2bl ). 

The perplexity results are presented in Table [2] We see 
that the tGaP-PFA model obtains superior perplexity 
to the static version, showing that incorporating de- 
pendency can improve performance when the data is 
assumed to be non-exchangeable. The stationary ver- 
sion of the tGaP-PFA model is a much simpler model 
than the BNBP topic model, which unsurprisingly per- 
forms better. However, since the BNBP model is based 
on a stationary CRM, our results suggest that a dy- 
namic version of the BNBP topic model, constructed 
with a thinned beta process, could achieve better per- 
formance than the stationary model. 

We also evaluate the ability of the our dynamic model 
to predict the decade of a held-out document. We 
hold out 20% of the documents in each decade and 
train the model on the remaining 80%. To predict the 
decade for a held-out document we find the decade 
that maximizes the predictive likelihood of the docu- 



Table 2: Average perplexity and average # of topics 
for State of the Union corpus over 5 hold-out sets. 





Static 


Dynamic 


BNBP 


Perp. 


624.4 ±2.0 


528.6 ±2.7 


418.1 ±0.9 


E[K] 


5.4 ±0.3 


62.6 ±0.8 


198.4 ±0.4 



Table 3: Predicting the decade of documents reported 
as absolute (LI) error in decades and accuracy (aver- 
age over 5 hold-out sets). 





Static 


Dynamic 


Baseline 


LI 


6.86 ±0.12 


2.42 ±0.02 


6.97 ±0.00 


Acc. 


0.05 ±0.00 


0.21 ±0.01 


0.05 ±0.00 




1800 1850 1900 1950 2000 

Figure 2: Activation functions over time for three top- 
ics. 

Table 4: Learned topics. 



Topic 1 


Topic 2 


Topic 3 


military (0.074) 


soviet (0.042) 


tribes (0.142) 


defense (0.061) 


nations (0.035) 


indian (0.124) 


war (0.056) 


security (0.022) 


indians (0.116) 


forces (0.051) 


peace (0.021) 


frontier (0.034) 


force (0.041) 


nuclear (0.020) 


greater (0.027) 



ment. We compare the dynamic model with a static 
version where we train a separate static tGaP-PFA 
model at each timestamp and predict the decade of 
a held out document by choosing the decade with the 
maximum predictive likelihood. We also compare with 
a baseline prediction that selects a decade uniformly at 
random. The results are presented in Table [3] where we 
see that the dynamic model obtains ^65% reduction 
in absolute (LI) error and ~ 4X increase in accuracy 
over the static and baseline models. Interestingly the 
static model performs on par with the baseline indicat- 
ing substantial over-fitting and displaying the necessity 
of taking time into account. 

In Figure J2J we depict the activation functions (the 
mean of r fc ' ) over time for three topics, and in Ta- 



ble [4] we show the top 5 words in each topic and the 
probability of the word under the topic. Topic 1 is a 
topic on war and we see it peak at most major con- 
flicts that the United States was involved in. Topic 2 
is about the Cold War and peaks at the beginning and 
end. Topic 3 regards Native Americans and is very 
prominent in addresses up to the 1850s and becomes 
less active in recent addresses. Figure [2] shows that the 
tGaP-PFA topic model is able to uncover multi-modal 
topic activations as well as localizing the topic usage 
in time. 

7 Discussion 

We have presented a framework for dependent ran- 
dom measures, that can be used as priors for a large 
class of nonpar ametric Bayesian models. Unlike previ- 
ous work, our construction is applicable to any CRM 
and has the added benefit that the resultant depen- 
dent CRMs retain any existing conjugacy. We showed 
that many dependent random measures in the liter- 
ature can actually be seen as specific cases of the 
thinned CRM framework. We demonstrated the effec- 
tiveness of the framework by using it to create a non- 
exchangeable latent feature model and a time- varying 
topic model. The models achieved superior predictive 
performance to exchangeable versions. 
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A Time-varying topic model 

Recall that w pnt represents the number of occurrences 
of word p in the nth document at time t, and that we 
decompose this as w pn t = ^2k=i ^pntki 

where w pn tk is 

the number of occurences attributed to topic k. In the 
generative process presented below, p indexes the vo- 
cabulary, t indexes the observed times of documents, 
n indexes the documents at a time t and takes values 
in {1, . . . , N t }, and k indexes the topics. Additionally, 



/ indexes the kernel functions of the RVM ( jTipping 



|200l| ) with centers m/, which we take to be the loca- 
tions of the observations (although this is not neces- 
sary). 

The generative process is as follows 



r := 



oo 

£ 

k=l 



7T k 5 {xkM ~ CRM(iy GO (d7r)H(dx)G (dO)) 



(5) 

where x k : = (uj ok , . . . , u Lk , (j) k ) ; u G0 (dir) = 
7T _1 exp(— 7r)d7T is the Levy measure of the gamma 
process with parameters (1,1); Bo(dO) is the P- 
dimensional Dirichlet distribution with parameter olq\ 
and H(dx) = H^dcj)) n^=o H^id^i), where H^dcj)) 
is the categorical distribution over the dictionary of 
kernel widths, and H UJ (duJi) ~ NiG(0, cq, do) is drawn 
from the normal-inverse gamma distribution. The rest 
of the model is 



Px k (t) 


= $(u ok + Yd=\Uik exp(-(/>/e \ \t - ti\\l)) 


(6) 


n,t 

r k 


~Ber(p Xk (t)) 


(7) 


G n ,t 


v^co n.t c 
'= l^k=l r k ^kOOk 


(8) 


ft* 


~ Ga(e,l),n = 1, . . . , N t , k e N 


(9) 


Wpntk 


~ Pois(0 fep r£'Vfc/3£'*) 


(10) 


Wp nt 


oo oo 
= ^2 Wpntk ~ PQ1S(^ Qkpr^TTkP^) 
k=l k=l 


(11) 



B Gibbs sampler 

We use a truncated version of the model by fixing the 
number of atoms we will represent to K and forming 
the (finite) random measure, T K := Y, k =i 7r k^( Xk ,cf )k ), 
where ir k ~ Ga(l/if, 1), x k := (u ok , . . . , u Lk , (j) k ), 
uj ik ~ NiG(0,c ,d ), and (j) k ~ {<^,...,<^}. In the 
limit, K — >• oo, Tx — >• T in distribution. This trun- 
cation allows for the derivation of a straight-forward 
Gibbs sampler. We assume T is the set of unique ob- 
served times. 

We sample each of the variables in turn from their 
full conditional distributions. We use a standard data- 
augmentation technique for probit regression to sam- 
ple the uj\ k variables by introducing an auxiliary vari- 



able f£' ~ N(p Xk (t), 1) for each topic k at each docu- 
ment n at time t, such that 



1 



iff?'* >0 



otherwise. 



See Albert & Chib (1993[) for details of the data aug- 



mentation. The conditional distributions are as fol- 
lows. 

• Topics, k . 

k \... ~ Dir(a e + wi.. k , . . . , a e + w P .. k ) (12) 

where w p .. k = J2teT ^2n=i ™pntk- 

• Global topic proportions, 7r k . 

N t 

7*1 . . . ~ Ga(t2)... fc + 1/K, E ft* + X ) ( 13 ) 

t^T n=l 

where ?2)... fc = £ p=1 Et G T £ 77,-1 Wpntk' 

• Per-topic counts, w pn tk- 

(Wpntl, • • • ,^pntK)| • ■ • ~ Mult(^ pnt ;^ n tl, • • • ^pntx), 

n n.t nn.t 

pk r k ' Tr k l3 k ' 

?pntfe K t t 

E,-=i ^ 

(14) 

and we ensure that the denominator is greater 
than by making sure that when sampling the 
rj?'^s, every document is not thinning at least one 
topic, i.e. VtVn3j, rj' £ = 1. 

• Per-document topic rate, 

f3^\ ... - Ga(w. ntk + a, rJJ'Vfc + 1) (15) 
where w. n£fe = w pn tfc. 

• Time-dependent indicators, r^' £ : There are 
three cases: 

1. Vj,r-' = 0-^' =1 



2. 3p, w pntk > 

3. Vp, ^pnt/c = 



n,t 



Cases 1 and 2 are deterministic. For case 3 let 
Upntk ~ Pois(p p ) with p p = OpkTtkP^ denote the 
fictitious count of word p in the nth document at 
time t assigned to topic k disregarding r£'*. The 
u pntk allow us to determine whether Wpntk — 

be- 
cause the topic has been thinned or because the 
topic is not popular (globally or for the individ- 
ual document). Case 3 above then splits into the 
following cases: 



1. yp,u pn tk = 0, r£' = 1 with probability cx 
P(^ = l)IlS=iPoi8(0;p p ) 



2. 3p, u pnt k > 0, rjj' = with probability oc 
p(r£'' = 0) (l-np=iPois(0;p p )) 

3. Vp,u pnt k = 0, = with probability cx 
P(C t = 0)np=iPois(0;p p ) 

We evaluate the three probabilities and sample 
from the resulting discrete distribution. 

RVM weights, uj^. We introduce the auxiliary 
variables Xik such that 

Xik ~ Ga(c ,d ) 

Let = (ujQki • • • ^Lk) T be the vector of RVM 
weights and be the vector of augmentation vari- 
ables for all all time stamps, and 

K tk = (1, K(t, m 1: fa) : . . . , if (t, m L , ^)) T (16) 

be the vector of the evaluation of the RVM kernels 
for time t. Then, the conditional of ojk is given 

by 

w fc |r fe) ...~N(£,fl) (17) 

where B = (diag(A ofe , . . . , A Lfc ) + Kj k v fc ) _1 and 
i = BKj k v k . 

RVM auxiliary variables, r? ,i: . 



C Perplexity computation 



Similarly to |Zhou et al. (2012b), given B samples of 
the model parameters and latent variables we compute 
a Monte Carlo estimate of the held-out perplexity for 
unobserved counts Y = [y^] as 



exp 



n P N t 



p=l teT n=l 



log 



2^b=l 2^k=l u pk n k 



tk 



ED s-^P s-^K 
6=1 2^p=l l^k=l 



7 pk n k 



tk , 



(21) 

where we have used a superscript b to denote the 
bth sample of the parameters and latent variables 



- El=i Vp '* denotes the held-out 



and y/ =E P =iEteT< 

number of occurrences of word p in the nth document 
at time t. 



^'i-)=c^!5^!!:s:;!°!' i, 



k 



N(^u, fc) l)l(f^<0), 



= 1 
= 
(18) 

which is a truncated normal distribution that we 
sample using the inversion method described in 



Albert & Chib (1993). 



RVM precisions, A 



ik- 



\ tk \ . . . ~ Ga ( c + -,d + ^fk 



(19) 



RVM kernel widths, fa. We assume a finite 
dictionary {<fil, . . . , 0^ } of possible values for the 
RVM kernel widths, and a uniform prior on these 
values, 



P(<t>k = 4>* m \ •••) 



1 Nt 



M -r , 

(20) 

where we have denoted the thinning function as 
a function of <jf as the other variables are held 
fixed. 



